Clone an EMR Cluster
This section will teach you how to re-launch an EMR cluster using previously established settings.
Log into AWS Academy and launch “AWS Lab” to get to the AWS Console. See the AWS Academy Learner Lab Student Guide in the Guides folder if you need a refresher. While you are in AWS Academy, take note of how much credit you have used in your account so far. If you run out of credits, let a member of the instructional team know.
Search in the AWS Console for
emrand click on theEMRservice as shown in the figure below.
- Now that you are in the EMR Dashboard, click on the box for the cluster that you previously made. Maybe this was a
Hue Clusteror aSpark Cluster. One you click on that row there will be a blue box (left yellow arrow). Then click on theClonebutton (top yellow arrow).
- Click on the blue
Clonebutton. You can leave the steps option as its default since there have been no steps applied to clusters in this class.
Step 4: Security
Select your correct
EC2 key pair. Use the key pair you created in lab 1, which should be calledmykeypair. If you do not select the appropriate key pair, you will not be able to connect to the cluster.Leave
Permissionsas is.Leave
Security Configurationalone.In the
EC2 security groups, you need to change yourEMR managed security groupsto be the security group we created. If you used the same name as the instructions, it will be calledopen-22. Select this group for both theMasterandCore & Taskrows.Click the blue
Create clusterbutton to launch your cluster!
Cluster Startup
Once you click on Create Cluster, you will be taken to the cluster summary page where you will see all relevant information about the cluster as shown in the figure below.
The cluster will go through several states until it is ready for you to use.
Starting
Running
Waiting
Cluster startup time can be 5-15 minutes or more! You must wait until the cluster is in Waiting state before you connect to it.
Jump to SSH Section
Start an EMR Cluster
This section will teach you how to launch an EMR or Elastic MapReduce cluster for PySpark jobs. A cluster means you will have several different machines working together in coordination. There will be one master node that coordinates running the jobs you ask the cluster to do and worker nodes to execute the work. EMR clusters in industry can be dozens, hundreds, or even thousands of cpu cores! We will be harnessing programming languages that are designed to work on clusters. As a Data Scientist, you can focus on writing code for data analysis and the cluster will handle setting up the multiple machines for you.
There are several changes when starting this cluster compared to the Hue cluster:
- You will be using different applications in step 1
- You will be add an S3 path in the Edit Software Settings section of step 1
- You will add a bootstrap action in step 3
- You will use port 8765 when ssh’ing to the cluster
You should already have an SSH keypair created and security group set up. If you have not accomplished these yet, go back to lab 1 to do so.
It is time to launch an Elastic MapReduce (EMR) Cluster! Follow these steps closely!
Log into AWS Academy and launch “AWS Lab” to get to the AWS Console. See the AWS Academy Learner Lab Student Guide in the Guides folder if you need a refresher. While you are in AWS Academy, take note of how much credit you have used in your account so far. If you run out of credits, let a member of the instructional team know.
Search in the AWS Console for
emrand click on theEMRservice as shown in the figure below.
- Now that you are in the EMR Dashboard, click on the blue
Create clusterbutton as shown with the yellow arrow in the figure below
- Now you will be on the
Create Cluster - Quick Optionspage. You will need to click theGo to advanced optionsbutton as shown with the yellow arrow in the figure below.
You will now complete the four step pages to create your own EMR cluster!
Step 1: Software and Steps
This step will let you choose your software and if you want specific jobs to run when the cluster start. In our class we never want to run any steps.
Software Configuration section
Use EMR version
emr-6.1.0from theReleasedrop down- NOTE THAT THIS SETTING WILL CHANGE BASED ON YOUR NEEDS
Confirm that the following are checked:
Hadoop 3.2.1, Spark 3.0.0 Leave the check box in
Multiple master nodes (optional)unchecked. This would be useful if you are a cloud architect who wants to set up an EMR cluster that will be running for weeks or months with multiple users.Leave in
AWS Glue Data Catalog settings (optional)unchecked. This would be useful for if you are a data engineer or architect and have built cloud databases or work in an organization with cloud databases.- NOTE THAT THIS SETTING CHANGE IS NEW FOR SPARK CLUSTERS
In the
Edit software settingssection, select the radio buttonLoad Json from S3and enter this path into the text boxs3://bigdatateaching/bootstrap/cluster-config.json, which will establish several important settings for your Spark cluster. Steps (optional)
Leave all of this section alone. We do not have any automated jobs we need to run so we’re not using this at all.
Click the blue
Nextto head to Step 2. The appropriate options are included in the figure below.
Step 2: Hardware
In this step you will apply the following steps so that your screen looks like the figure below.
Leave the
Cluster Compositionsection alone, we wantUniform instance groups. You could haveInstance Fleetsif you were a cloud architect building a cluster for a company.Leave all the defaults for the
Networkingsection. These settings would be changed if you were setting up a virtual private cloud (VPC) for a company. This is common for workplaces to have their on VPCs. Do not worry, you will not have to set up any VPCs. Most of the time these are already configured by cloud architects so data scientists can launch EMR clusters like how we are doing.
Cluster Nodes and Instances section.
There is a table of instance groups. There are three types: Master, Core, and Task.
- The Master instance will coordinate your cluster activities and send jobs to the core and task nodes.
- The core nodes will store data for your cluster and compute jobs that are received from the master node.
- The task nodes are used for compute jobs and do not store data as part of the distributed file system. In this class, we will not be using task nodes.
In the first row of the table, you will keep the instance type as
m5.xlarge.- If you did want to change the instance type, then you would click on the tiny pencil next to the instance name then scroll through the list of instances to select
m5.xlargeand clickSave
- If you did want to change the instance type, then you would click on the tiny pencil next to the instance name then scroll through the list of instances to select
In the second row of the table (core nodes), keep the same
instance typeso that you are usingm5.xlarge. Follow the previous bullet for changing instance type if necessary.Change the
instance countto use more core nodes. We will use up to 7 core nodes. There should be 1 master node, 7 core nodes, and 0 task nodes.- AWS Academy limits you to using 32 cores at once. Assuming you are not running anything else on AWS right now, you can launch 8 4-core nodes for your EMR cluster.
Leave
Cluster scalingunchecked. This option could be useful when you are working and want to get more resources dynamically. But watch out, more resources also means more money!Enable
Auto-termination, and leave the idle time at1 hourand0 minutes. This will serve as a back up kill switch in case you forget to turn off your EMR cluster. It is always your personal responsibility to turn off your EMR cluster after you are done with your work! You will use all your credits in less than a week if you forget to turn off your EMR cluster.Keep the defaults for
EBS root volume.Click the blue
Nextbutton at the bottom of the page to go to step 3.
Step 3: General Cluster Settings
General Options section
Give the cluster a name that is meaningful. Call the cluster
Spark Cluster EMR 6.1.0Leave the other settings here alone. Logging will be useful is your cluster crashes. You can leave the default S3 bucket to store your logs. We do not need to encrypt our logs. If you were working with confidential of classified data then you should encrypt your logs! Leave
DebuggingandTermination protectionchecked.Leave the
Tagssection alone
Open the
Additional OptionssectionNOTE THAT THERE IS A NEW STEP HERE FOR SPARK CLUSTERS- Go to the Add bootstrap action dropdown and select
Custom actionfrom the dropdown as shown in the figure below, and click on theConfigure and addbutton.
- In the Add Bootstrap Action dialog box, enter the following location in the Script location section:
s3://bigdatateaching/bootstrap/bigdata-bootstrap_emr6.sh
Here is a summary of what this script does:
Installs Miniconda and Python3 on every node of the cluster, with many additional Python libraries
Installs and starts JupyterLab automatically on port 8765, and you can use it for many repositories
Installs git
Tells YARN to allocate the most possible resources to Spark
Make sure you see the custom action in the screen then click the blue
Nextbutton to go to step 4.
- Go to the Add bootstrap action dropdown and select
Step 4: Security
Select your correct
EC2 key pair. Use the key pair you created in lab 1, which should be calledmykeypair. If you do not select the appropriate key pair, you will not be able to connect to the cluster.Leave
Permissionsas is.Leave
Security Configurationalone.In the
EC2 security groups, you need to change yourEMR managed security groupsto be the security group we created. If you used the same name as the instructions, it will be calledopen-22. Select this group for both theMasterandCore & Taskrows.Click the blue
Create clusterbutton to launch your cluster!
Cluster Startup
Once you click on Create Cluster, you will be taken to the cluster summary page where you will see all relevant information about the cluster as shown in the figure below.
The cluster will go through several states until it is ready for you to use.
Starting
Running
Waiting
Cluster startup time can be 5-15 minutes or more! You must wait until the cluster is in Waiting state before you connect to it.
SSH Into and Use the Cluster
Go to the EMR console and click on the cluster of interest which takes you to the cluster’s summary page
Copy the Master Public DNS from the Summary Section. In the figure below that looks like this ecX-XXX-XXX-XXX.compute-1.amazonaws.com. Yours will be different.
Open the same terminal on your local laptop that you used in the terminal setup section. Add your private key to memory using ssh-add (use the right approach based on your operating system.)
Run the command
ssh -A -L 8765:localhost:8765 hadoop@[[YOUR MASTER NODE DNS ADDRESS]]
Note the username is
hadoop. Get your cluster’s master node IP address from the Cluster console.- Use
8765for the####port number. - The
####depends on the type of cluster you are using. For a spark cluster, use8765. We will be using Spark later on in the semester. For a Hue cluster, use8888. All the communications coming from and going to your AWS machine are going through the port number that you specify. We use standard ports to make things simple for the applications to communicate with you. Other common port numbers are 443 for secure websites, and 80 for non-secure websites. See a whole list of them here
- Use
Your command and login message output will look something like the following:
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.
Try the new cross-platform PowerShell https://aka.ms/pscore6
PS C:\Users\Monke> ssh-agent
PS C:\Users\Monke> ssh-add
Identity added: C:\Users\Monke/.ssh/id_rsa (C:\Users\Monke/.ssh/id_rsa)
PS C:\Users\Monke> ssh -L 8765:localhost:8765 hadoop@ec2-54-211-241-107.compute-1.amazonaws.com
Last login: Thu Oct 21 03:05:27 2021 from pool-108-31-220-48.washdc.fios.verizon.net
__| __|_ )
_| ( / Amazon Linux 2 AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-2/
13 package(s) needed for security, out of 44 available
Run "sudo yum update" to apply all updates.
EEEEEEEEEEEEEEEEEEEE MMMMMMMM MMMMMMMM RRRRRRRRRRRRRRR
E::::::::::::::::::E M:::::::M M:::::::M R::::::::::::::R
EE:::::EEEEEEEEE:::E M::::::::M M::::::::M R:::::RRRRRR:::::R
E::::E EEEEE M:::::::::M M:::::::::M RR::::R R::::R
E::::E M::::::M:::M M:::M::::::M R:::R R::::R
E:::::EEEEEEEEEE M:::::M M:::M M:::M M:::::M R:::RRRRRR:::::R
E::::::::::::::E M:::::M M:::M:::M M:::::M R:::::::::::RR
E:::::EEEEEEEEEE M:::::M M:::::M M:::::M R:::RRRRRR::::R
E::::E M:::::M M:::M M:::::M R:::R R::::R
E::::E EEEEE M:::::M MMM M:::::M R:::R R::::R
EE:::::EEEEEEEE::::E M:::::M M:::::M R:::R R::::R
E::::::::::::::::::E M:::::M M:::::M RR::::R R::::R
EEEEEEEEEEEEEEEEEEEE MMMMMMM MMMMMMM RRRRRRR RRRRRR
[hadoop@ip-172-31-83-145 ~]$
- Open a browser and navigate to http://localhost:8765 to see your Jupyter Lab environment. Your environment will look like the figure below.
TERMINATE YOUR EMR CLUSTER!
Go to the cluster’s summary page where you grabbed the Master DNS Address. Click on the gray Terminate button.
Then you will get a popup window about shutting down your instance. If the red Terminate button is red like the figure below, click it.
If the button is grayed out like below, you will need to turn off cluster termination protection. To do so, click the Change link.
Then choose the off radio button, then click the green check mark symbol. Next click the red Terminate button.
Once you have successfully terminated the cluster, you will see the yellow terminating text on the cluster’s summary page. That means you can close the page and the cluster will terminate.